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Abstract: One of Pranab K. Sen's major research areas is sequential non- 
parametrics and semiparametrics and their applications to clinical trials, to 
which he has made many important contributions. Herein we review a number 
of these contributions and related developments. We also describe some recent 
work on nonparamctric and semiparamctric inference and the associated com- 
putational methods in time-sequential clinical trials with survival endpoints. 



1. Introduction 

Sequential nonparametrics began in the 1960s with the work of Wilcoxon, Rhodes 
and Bradley [68] on extending Wald's [66] sequential probability ratio test (SPRT) 
to construct two-sample grouped rank-sum tests, and with Savage and Sethura- 
man's [50] invariant SPRT of Hq : F = G versus a single Lehmann alternative 
Hi : F = G 5 in the two-sample problem associated with population distribution 
functions F and G. It was soon recognized that natural hypotheses in conventional 
(i.e., fixed sample size) nonparametrics could not be handled by invariant SPRTs, 
e.g., F = G s with fixed 5 ^ 1 is highly artificial. An alternative approach is to make 
use of the weak convergence of the sequence of suitably normalized nonparamet- 
ric test statistics under the null hypothesis and under contiguous alternatives to 
Brownian motion, with drift under Hq and drift 8 under a contiguous alternative, 
so that the nonparamctric testing problem is asymptotically equivalent to that of 
testing whether the drift of Brownian motion is or 9. Sen and Ghosh [57], Sen 
[53, 54], Ghosh and Sen [18], Hall and Loynes [24] and Lai [34, 35] were some of the 
earliest applications of this approach. Chapters 9 and 11 of Sen's [56] monograph 
give an overview of these and subsequent developments in nonparametric sequential 
testing. Chapter 10 reviews related developments in nonparametric sequential in- 
terval and point estimation, while Chapters 2-8 provide the basic weak convergence 
theory 

Besides marking the publication of Sen's monograph, the year 1981 also wit- 
nessed another important event in sequential nonparametrics and its applications 
to clinical trials. The early termination of the Bcta-Blocker Heart Attack Trial 
(BHAT) in October 1981 prior to its prcschcdulcd end nine months later drew im- 
mediate attention of the medical community to the benefits of sequential methods 
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and spurred important advances and new developments in sequential nonparamet- 
rics and semiparametrics in the following two decades, which are reviewed in Section 
2. To tackle the complexity of time-sequential clinical trials with staggered patient 
entry and failure-time endpoints, new weak convergence results and asymptotic ex- 
pansions were developed by making use of powerful tools involving continuous-time 
martingales, stochastic integrals and empirical process theory. Section 3 summa- 
rizes these and other results under the rubric of functional central limit theorems 
(CLT), also called "invariancc principles" as in Sen's monograph, in sequential non- 
parametrics and semiparametrics. 

In applying the asymptotic theory to the design and analysis of time-sequential 
trials with failure-time endpoints, several implementation issues need to be ad- 
dressed and are discussed in Section 4. For example, the asymptotic approximations 
may be inadequate for the sample size considered, or may be difficult to compute 
directly. Another long-standing problem is related to terminal analysis of the trial. 
How can valid nonparametric/semiparametric confidence intervals be constructed 
for the primary and secondary endpoints when early stopping may occur during 
interim analysis of the trial? This problem is addressed in Section 4 which makes 
use of recent developments in Monte Carlo and resampling methods to resolve dif- 
ficulties in the design and analysis of complex time-sequential trials. 

2. Time-sequential rank tests in clinical trials 

In a typical clinical trial to compare times to failure between two treatment groups 
X and Y, n patients enter the trial at different times during an accrual period, are 
randomly assigned to treatment X or Y and are then followed until they fail or 
withdraw from the study or until the study is terminated. In particular, BHAT was 
a multicentcr, double-blind, randomized, placebo-controlled clinical trial designed 
to test the efficacy of long-term therapy with propranolol given to survivors of 
an acute myocardial infarction. The trial was scheduled for 4 years, with reviews 
of the data by a Data and Safety Monitoring Board (DSMB) at 11, 16, 21, 28, 
34, 40 and 48 months. The trial design assumed an accrual rate of 149 patients 
per month for a period of 27 months, so the planned total number of patients 
was 4123. Another assumption is that each patient is randomized to placebo or 
treatment upon entering the trial, and is followed for a maximum of 3 years. The 
actual recruitment period was 27 months, within which 3837 patients were accrued 
from 136 coronary care units in 31 clinical centers, with 1916 patients randomized 
into the propranolol group and 1921 into the placebo group. Successive values of 
the standardized logrank statistics (see the next paragraph) at 11, 16, 21, 28, 34 
and 40 months when the DSMB met were 1.68, 2.24, 2.37, 2.30, 2.34 and 2.82, 
respectively. Instead of continuing the trial to its scheduled end at 48 months, the 
DSMB recommended terminating it in their last meeting because of conclusive 
evidence in favor of propranolol. To adjust for repeated significance testing, the 
DSMB used critical values provided by O'Brien and Fleming's [43] method for 
normal observations; see Section 2.2. Since logrank statistics (rather than normal 
means) were actually used, the DSMB also appealed to Tsiatis' [62] result on the 
joint asymptotic normality of time-sequential logrank statistics. 

To describe a typical time-sequential trial with staggered patient entry, we first 
introduce the following notation. Let T[ > denote the entry time and Xi > the 
survival time (or time to failure) after entry of the ith subject in treatment group 
X, and let T'. 1 and Yj denote the entry time and survival time after entry of the jth 
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subject in treatment group Y, 1 < i < n', 1 < j < n" . Let n = n' + n". Thus the 
data at calendar time t consist of (Aj(i), <5-(i)), i = 1, . . . ,n', and (Yj (t) 7 S'J (t)) 7 j = 
1, . . . , n", where X t (t) = X t A £ A (t - T!)+,Yj(t) = Yj A ?j A (t - T/)+, 6<(t) = 

^{X 4 (i)=Xi} 5 ^'(*) = ^{y J (t)=Y J }, rn' n j(s) = J27=i 7 {X 4 (*)>*}> ™",t( s ) = 
2^—1 -^{y}(t)> s }) an d Ci(^j') denotes the withdrawal time, possibly infinite, of the 
iih (jth) subject in treatment group X(Y). At a given calendar time t, one can 
compute, on the basis of the observed data from the two treatment groups, a rank 
statistic of the general form considered by Tsiatis [63] : 

^ , ( m'AXAt)) ] 

fri { <»^(t))+< t Wt))J 

„ mi t (K(t)) 

where Q n (t,s) is some weight function satisfying certain measurability assump- 
tions. Let H Ut t denote a product-limit-type estimator of the common distribu- 
tion function of the two treatment groups under the null hypothesis, based on 
{(Xi(t),Si(t),Yj(t),Sj(t)) : i < n',j < n"}. Note that this setting is considerably 
more complicated than the progressively censored case considered by Chatterjee 
and Sen [9] and Sen [55, 56] in the context of life testing experiments. As will be re- 
viewed in Section 3, for a general weight function of the form Q n (t, s) — ip(H n , t (s)) 
in (1), {S n (t)/^n 7 1 > 0} converges weakly to a Gaussian process with independent 
increments and variance function V(t) under the null hypothesis, and contiguous 
alternatives. Gu, Lai and Lan [22] have pointed out that these time-sequential rank 
statistics based on censored data are natural extensions of classical rank statistics 
in fixed sample size (FSS) tests of Hq : F = G, where F is the common distribution 
function of the Xi and G is that of the Yj, in which Xi and Yj are completely 
observable (i.e., there is no censoring). Letting Ri denote the rank of Xi in the 
combined sample {Xi, . . . , X n r, Yj., . . . , Y n "}, a classical rank statistic has the form 
X)"=i ^(Ri/^j where (p : (0, 1] — > 1Z is a score function, and its extension to cen- 
sored time-sequential data has the form (1) with Q n (t,s) = tp(H nt t(s)), with ip 
related to ip via the relation 

ip(u) = ip(u) - (1 - uy 1 j cp(t)dt, 0<u<l; (2) 

J u 

see Gu, Lai and Lan [22]. Taking ip(u) = (1 — u) p (p > 0) yields the G p statistics 
proposed by Harrington and Fleming [25]. The case p = corresponds to Mantel's 
[42] logrank statistic whose corresponding tp is the asymptotically optimal score 
function for testing Lehmann alternatives. The case p = 1 corresponds to the gen- 
eralization of Wilcoxon's statistic by Peto and Peto [44] and Prentice [48]. In the 
remainder of Section 2 it will be assumed that Q n (t, s) = - 0(iJ„ ! t(s)), unless stated 
otherwise. 

The mean function of the limiting Gaussian process associated with S n {t)/\/n is 
under the null hypothesis Hq : F = G and is of the form fi g (t) under contiguous 
alternatives that satisfy 

f 1^- l|dAj? = 0(-U (*) - 1} - ffOO (3) 

Jo dA F v n dA F 
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asm oo, uniformly over closed subintervals of {s £ [0, t*] : F(s) < 1}, where Ap 
and Aq are the cumulative hazard functions of F and G; see Section 3. Moreover, 
H g (t) = V(t) when ip is an asymptotically optimal score function of the form ip(-) = 
g(F (■)). In practice, the actual alternatives are unknown and u g need not even be 
monotone when ip is not optimal for the actual alternatives, such as using logrank 
statistics for non-proportional hazards (i.e., not Lchmann) alternatives, which allow 
repeated significance tests based on S n (t) to achieve both savings in study duration 
and increase in power over the fixed-duration test based on S n (t*); see Example 1 
of Gu and Lai [20]. 



2.1. Estimation ofV(t) and two time scales 



An estimate of the variance of S n (t) under Ho can be used to estimate the variance 
V(t) of the limiting Gaussian approximation to S n (t)/\/n under Ho and under 
contiguous alternatives. Two commonly used variance estimates are 



,* ib 2 (H nt (s))m' Js)m"Js) 



F ''( f '= f r7TnTTi«(K,.W) ! ««(«) + «.(»))'«:»)» (i*) 

JO l m n,il s J + m n : t( s )) 

where N^ t = £)£=i I{ Xi <^(t-T;)+As}, = YJj=i I {Y ] <q A{t-T^>)+ r\s}- As a 

compromise between these two choices, Gu and Lai [19] also considered 

V n (t) = {(4a) + (4&)}/2. (4c) 

For all three estimates, n _1 T^(i) converges in probability to V(t) under Ho and 
under contiguous alternatives. Hence, letting v = n ^V n (t) and W(v) = n~^ 2 S n {t), 
we can regard W(v),v > 0, as the standard Wiener process under Ho- Moreover, 
if ip is a scalar multiple of the asymptotically optimal score function, then we can 
also regard W(v),v > 0, as a Wiener process with some drift coefficient under 
contiguous alternatives. 

As pointed out by Lan and DeMets [39], there are two time scales in time- 
sequential trials with failure-time endpoints. One is calendar time t and the other 
is information time V n (t), and there is no simple relation between them. The cal- 
endar times of interest are those when interim (including final) analyses are per- 
formed and they are usually specified, at least approximately, in the trial protocol. 
The information time V n (t) is typically unknown before time t unless restrictive as- 
sumptions are made a priori. The two time scales have led to several long-standing 
difficulties in applying the asymptotic theory to repeated significance tests based 
on time-sequential rank statistics and to interval estimation following these tests. 
These difficulties have been resolved in the past five years, as reviewed in Sections 
2.2 and 4.2. 



2.2. Stopping boundaries for time- sequential rank tests 

Let < t\ < ■ ■ ■ < tfc = t* be prescribed times for periodic reviews of the data. To 
test Ho : F = G with the time-sequential rank statistics S n (ti), Slud and Wei [61] 
introduced the following simple approach. First choose positive numbers ax, ■ • • , ot^ 
such that a, = a {= the overall significance level). Then use the multivariate 
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normal approximation to the null distribution of (S n (ti)/V n (ti))i<i<k to deter- 
mine di , . . . , dk recursively by 

PH {\S n (tj)\/V^ 2 (tj) > dj and \S n (t t )\/V^ 2 (U) < d t for all i < j} = ctj. (5) 

With the dj thus determined, the Slud-Wei repeated significance test rejects Ho 

1/2 

whenever |SVi(^)| > djV n (tj)(l < j < k) and stops the trial at the first tj this 
occurs (or at t* if this does not occur for 1 < j < k). 

The Slud-Wei method does not provide practical guidelines concerning how the 
ctj in (5) should be chosen. Lan and DeMets [38] and Lan, DeMets and Halpcrin 
[40] proposed to derive the ctj from an error spending function, which specifies how 
fast we can spend the Type I error a over time. To begin with, let {W(v) 7 < v < 
1} be the standard Wiener process and consider the stopping rule T = inf{v <G 
[0, 1] : |W(i;)| > h(v)}(hxf = oo), where h is a positive function on [0,1] such that 
P{T = 0} = and P{T < 1} = a. The error spending function is A(v) = P{T < 
w},0 < v < 1. Taking v to represent the proportion of information accumulated 
at time t, with A(Q) = and A(l) = a. In particular, suppose that instead of 
survival data one has immediate responses from the patients who enter the study 
serially and are randomized to cither treatment, with a target sample size of n at 
the scheduled end of the trial. Lan and DeMets [39] call such trials "maximum 
information trials" . Here the proportion of information accumulated at time ti of 
interim analysis Vi = fii/n, where rii is the total number of patients available at U. 
Hence Lan and DeMets [38] proposed to choose ctj = A(vj) — A(vj-i) in (5). For 
the time- sequential rank statistics (1) in what Lan and DeMets [39] call "maximum 
duration trials", the asymptotic null variance of S n (U) is no longer proportional to 
the sample size rn at ti and a natural analogue oini/n here is V n (U) /V n (t*) . 

Let Z\, Z2, ■ ■ ■ be i.i.d. normal random variables with unknown mean 6 and known 
variance 1. To test H : = at level a, the Neyman-Pcarson test rejects H if 
I Ei=i %i\ — z aVk, where 1 — = a. Sample size calculation in clinical trial 

applications typically assume an alternative 9 of particular interest and find the 
k that attains some given power 1 — j3 at 6. The basic idea behind Haybittlc's 
[26] repeated significance test is to keep k and a as the maximum sample size 
and significance level but to allow for early stopping when the data are monitored 
sequentially, at the expense of some minor loss in power at 8. This leads to the 
stopping rule Tf, = min(/c,inf{n > 1 : 1X^=1-^1 — ^V 7 "}) an d terminal decision 
rule that rejects H if t& < k or if Tf, = k and | Xa=i %i\ ^ cVk. Since we require the 
loss in power at to be small relative to the fixed sample size test and also require 
the maximum sample size to be the same as the fixed sample size k, it is clear that 
c has to be near z a , implying that Po( r fc < k) is small in comparison with a. In 
particular, the Haybittle-Peto method [26, 45] in the field of clinical trials uses some 
relatively large value of 6, such as 3, and conventional critical values of c for the 
final test when the number k of interim analyses is small. The fact that Po(t& < k) 
is typically small relative to a (or equivalently that most of the Type I error is 
to be spent at the terminal data t*) suggests that using an elaborate Lan-DeMets 
boundary determination procedure would not lead to substantial improvement over 
a simple procedure of the Haybittle-Peto type. Lai and Shih [37] recently developed 
a theory of group sequential tests, based on Zi, Z2, ■ ■ ■ , for the parameter 9 of 
an exponential family of densities fg(z) = e 6z ~^^ with respect to some measure 
on the real line. This theory yields the following modified Haybittle-Peto test as 
an asymptotically optimal solution. Let S n = Z\ + ■ • • + Z n . To test the one-sided 
hypothesis H : 9 < 9 at significance level a, suppose no more than M observations 
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are to be taken. The fixed-sample-size test that rejects Hq if Sm > c a has maximal 
power at any alternative 6 > Oq, in particular at the alternative 9(M) "implied" 
by M (in the sense that M can be derived from the assumption that the above 
fixed-sample-size test has some prescribed power 1 — a at 0(M)). Although the 
protocol of a clinical trial typically justifies its choice of sample size by stating 
some conventional level (such as 80% or 90%) at a specified alternative, one often 
does not have much information prior to the trial to come up with a realistic 
alternative. Under the constraint of M on the sample size, it is desirable to adapt 
to the information on the actual 9 gathered during the course of the trial, allowing 
early stopping at times of interim analysis so that the test has nearly optimal power 
and expected sample size properties. 

To achieve these goals in a group sequential test with k groups and group sizes 
Tii, /la — th, . . . , Tifc — rifc_i so that rife = M, Lai and Shih [37] use a rejection region 
of the form S nk > c at the k th analysis, where c > c a but c does not differ much 
from c a . For the first k — 1 analyses, they use a stopping region of the form 

9 ni > 6*o and nil(9 ni ,9 ) > b, or (6a) 

e ni <6{M) and nJ(9 riz ,9(M)) >b, (66) 

for 1 < i < k — 1, where 1(9, X) denotes the Kullback-Lciblcr information number 
E g {\og[f s (Zi) / fx(Zi)}} . If (6a) holds, reject H upon stopping. If stopping occurs 
with (6b), accept H . In case stopping does not occur in the first k — 1 analyses, 
reject H if S 7lk > c. The thresholds b, b and c are so chosen that Pg (Test rejects 
Hq)= a and the power of the test at 9(M) does not differ much from its upper 
bound 1 — a. In the special case of 9 a — and normal Zi with mean 9 and vari- 
ance 1, tp(6) = 9 2 /2 and 1(9, X) = (9 — A) 2 /2, so the test reduces to that in the 
preceding paragraph. Previous works on efficient group sequential tests have used 
the expected sample size at an alternative, or more generally a weighted average of 
expected sample sizes over a set of parameter values, as the optimization criterion 
while controlling the error probabilities under the null hypothesis and a specified 
alternative at prescribed levels; see Pocock [47], Wang and Tsiatis [67], Kim and 
DeMets [32], Eales and Jennison [15] and Barber and Jennison [4]. There are several 
practical difficulties with this approach to efficient group sequential design. First, 
even though the mean of the random sample size is minimized at some alternative, 
the maximum sample size can be substantially larger than the mean and also the 
fixed sample size. Secondly, the optimization problem requires precise specification 
of the relative sizes of all groups, e.g. equal group sizes, but it is often not feasi- 
ble to do so prior to the trial because interim analyses are usually scheduled at 
calender times for administrative reasons. Thirdly, it may be difficult to come up 
with a realistic alternative before data are collected from the trial, but the opti- 
mization problem depends on the chosen alternative. Clearly efficiency of a group 
sequential test depends not only on the choice of the stopping rule but also on 
the test statistics used. To decouple these two issues, Lai and Shih [37] consider 
the one-parameter exponential family, for which sufficient statistics are the sam- 
ple means that are maximum likelihood estimators of ip'(9). Although the normal 
case is usually chosen to be the prototype in the group sequential literature, Lai 
and Shih choose the exponential family because linearity of i/j' in the normal case 
obscures the general form of (nearly) optimal test statistics and stopping bound- 
aries. Making use of Hocffding's [27] lower bound for the expected sample size of a 
test that has Type I error probability a at 9$ and Type II error probability a at 
9(M), they establish the asymptotic efficiency of the modified Haybittle-Peto test 
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by showing that the expected sample size of the test under Pg attains Hoeffding's 
lower bound asymptotically as a — > and a — ► 0. They have also carried out ex- 
tensive simulation studies of the expected sample size and power function of the 
modified Haybittle-Peto test in the N(6, 1) case, comparing it with different classes 
of group sequential tests that have been developed primarily for this normal setting 
in the literature. Their numerical results demonstrate the advantages of the mod- 
ified Haybittlc-Pcto test, which is flexible and efficient and can "self-tune" to the 
unknown parameters during the course of the trial, under prespecified constraints 
on the maximum sample size and Type I error probability. 

The preceding modified Haybittle-Peto test has in fact been proposed earlier by 
Gu and Lai [20] for repeated significance testing, which involves a maximum of 
k significance tests, based on the time-sequential censored rank statistics (1). Gu 
and Lai proposed to determine b such that Pq(ti, < k) = ea, where < e < 1 
is small and tj, is the stopping rule defined above for normal Z^. With b thus 
chosen, their repeated significance test stops the trial and rejects H at ti < t* if 

1/2 

|Sn(*j)| > bV n (ti). If the trial proceeds to the terminal date t* , their test rejects 
l li 

H if |5 n (i*)| > cV n (t*), where c is so chosen that 

P{\W{V n {t k ))\ > cVV\t k ) or \W(V n (U))\ > bV^(U) 

for some i < k\V n (h), V n (t k )} = a, (7) 

in which t k = t* and {W(v),v > 0} is a standard Wiener process independent of 
5$, CI',!?'),* > 1}. Letting a, = V n {t ) and d k = c,dj = b for 1 < j < 
k — 1, the probability (7) can be written as a sum of the probabilities P{|M / (a J )| > 
djJaj and W(ai) < di^/al for all i < j}, which can be computed by the recursive 
numerical integration algorithm of Armitage, McPherson and Rowe [3]. The choice 
of c in (7) is not predetermined at the beginning of the trial but depends on the 
actual values of V n (ti), . . . , V n (tk), allowing great flexibility in how information 
accumulates at different times of interim analyses. This modified Haybittle-Peto 
time-sequential test based on (1), with t restricted to the set {ti, . . . , t k } of calendar 
times at which interim analyses are conducted, yields an efficient stopping rule that 
circumvents the difficulty of "calendar time" versus "information time" in the error 
spending approach, which Scharfstein and Tsiatis [51] proposed to address by using 
simulations at each interim analysis to estimate the maximum information under 
the null hypothesis. 

2.3. Adjustments for other covariates in testing treatment effects 

It is widely recognized that tests of treatment effects based on the rank statistics (1) 
may lose substantial power when the effects of other covariates are strong. In nonse- 
quential trials, a commonly used method to remedy this when logrank statistics are 
used is to assume the proportional hazards regression model and to use Cox's par- 
tial likelihood approach to adjust for other covariates. Tsiatis, Rosner and Tritchlcr 
[65], Gu and Ying [23] and Bilias, Gu and Ying [5] have developed group sequential 
tests using this approach. Instead of relying on the proportional hazards model to 
adjust for concomitant variables, it is useful to have other methods for covariate 
adjustment, especially in situations where other score functions than the logrank 
are used in (1) to allow for the possibility of non-proportional hazards alterna- 
tives. Lin [41] and Gu and Lai [20] have proposed alternative covariate adjustment 
methods based on rank estimators and M-estimators in accelerated failure time 
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models. In the absence of more efficient computational schemes, these estimators 
took much longer to compute than those based on the proportional hazards model. 
The situation has subsequently improved with the algorithm of Jin et al. [29] for 
rank estimators and Kim and Lai [33] for M-estimators. 

3. Functional CLT in sequential non-(or semi-)parametrics 

Let F n be the empirical distribution function of i.i.d. random variables X\, X%, ■ ■ ■ 
with common continuous distribution function F, and let Fjj(u) = u for < u < 1, 
the distribution function of a uniform random variable. Whereas the weak con- 
vergence of y/n(F n o F^ 1 — Fjj) to Brownian bridge, when applied in conjunction 
with the functional delta method, provides a basic tool for deriving asymptotic 
normality of nonparametric/semiparametric statistics, sequential nonparametrics 
and semiparametrics involve an additional time parameter, which results in weak 
convergence to Gaussian random fields (i.e., multiparameter processes). Sen's [56] 
monograph therefore focuses on weak convergence in C([0, l] fe ) or D([Q, l] k ) with 
k > 2. In particular, [nt] (Fi nt -i o F^ 1 — Fu)/^/n converges weakly in D([0, l] 2 ) to a 
Kicfcr process K(t, s) whose covariancc function is given by 

Cov(K(t, s),K(t', s')) = (t A t')(s A s' - as'), 

i.e., K (•, s) is Brownian bridge for every fixed s and K(t, •) is Brownian motion for 
every fixed t. An alternative approach is to represent the nonparametric statistics 
(e.g. /7-statistics, rank statistics and linear combinations of order statistics) in terms 
of partial sums S n of i.i.d. random variables plus negligible remainders; see Lai 
[34, 35]. 

When the Xi are not fully observable because of censoring by £j, it is more conve- 
nient to work with the cumulative hazard function A(:r) = J^ oo (l—F(s—))^ 1 dF(s), 
which can be consistently estimated by the nonparametric maximum likelihood es- 
timator A n (x) = J_ oo (m n (s))~ 1 dN n (s) 1 where 

n n 
i=X i=l 

Making use of the key property that {N n (s) — f ^ m„{t) dA(t), s > 0} is a square 
integrable martingale with predictable variation process m n (t)dA(t), we can 
apply martingale central limit theorems to prove asymptotic normality of statistics 
of the form Q n (s) dA n (s), where Q n (-) is a predictable process. The monograph 
by Andersen et al. [1] summarizes this martingale approach to functional central 
limit theorems for nonparametric and semiparametric statistics based on censored 
data and their applications. 

This martingale approach can be easily extended to the vector of the time- 
sequential rank statistics (S n (ti), . . . , S n (tk))/y/n involving a fixed set of calendar 
times ti, . . . , tk, where S n (t) is defined in (1), as shown by Tsiatis [62, 63]. However, 
proving weak convergence of the continuous-time process {n~ 1 / 2 S n (t) 1 < t < 
t*}, where t* is the terminal tie of the study, is much more difficult even though 
it involves only checking an additional tightness condition. By combining certain 
maximal inequalities for continuous-time martingales with empirical process theory, 
Gu and Lai ([19], Lemma 2 and Appendix) established the desired tightness under 



340 



T. L. Lai and Z. Su 



some weak conditions on the weights Q„ (t, s) in (1) and the assumptions that 

m 

b'(t,s)= lim m^VPU^M-T^s}, 

m. — >rvi — * 



b"(t, s) = lim m- 1 V P{£' > s, t - T'! > s} 

3 = 1 

exist and are continuous in < s < t, that 

n /n — > 7 as n — > oo with < 7 < 1, 

and that F and G are continuous. Instead of S n (t), they considered S n (t, s) which 
is defined by (1) but with Ei< t <„' and Ei<j<n» replaced by Y,t-.xdt)<s and 
Y^,j:Y (t)<s- ^ n particular, for the case Q n {t, s) = ip(H n (t, s)) such that (1 — x) /S iIj(x) 
is a function of bounded variation on [0, 1] for some < (3 < |, their weak con- 
vergence theory for the random field {S n (t, s),0 < t < t* , < s < t*} yields the 
following results for S n (t): 

(i) For fixed F and G, {n, _1 / 2 (S'„(t) — fi n (t)),0 < t < t*} converges weakly in 
D[0, t*] to a zero-mean Gaussian process and n -1 yu n (t) converges in proba- 
bility as n — > 00, where 

/"* m' ,(s)m" ,(s) 

M* = / V> (H n ,t (S ) "f ^ V / (rfA F S - dAg 8 ■ 

Jo rn nt {s)+m nt {s) 

(ii) Let {Z(i),0 < t < t*} denote the zero-mean Gaussian process in (i) when 
F = G. This Gaussian process has independent increments and 

Jo W{t,s) + (l-7)6"(i,s) 

(hi) For fixed F (and therefore Ap also), suppose that as n — > 00, G — ► F such 
that (3) holds uniformly over closed subintcrvals / of {.s e [0,<*] : F(s) < 1} 
and sup seJ | ff (s)| < 00. Then { n - 1 / 2 S n (t), < t < t*} converges weakly in 
D[0, t*} to {Z(t) + n(t), < t < t*}, where Z(t) is the same Gaussian process 
as that in (ii) and 

^=-^-^J wmw^m^) dF{u) - 

From (ii) and (iii), the limiting Gaussian process of {n- 1/2 S n (t),t > 0} has inde- 
pendent increments under H : F = G and under contiguous alternatives. 

Previous results in the literature only treated the case F = G. In particular, 
assuming the T[ (and T", respectively) to be i.i.d., Tsiatis [63] showed that 

(rT 1 ! 2 S n {ti), ■ ■ ■ ,'nr 1 / 2 S n {tk)) has a limiting multivariate normal distribution for 
any k. Sellke and Siegmund [52] proved tightness and weak convergence in the case 
where the S n (t) are logrank statistics, and more generally where S n (t) is the score 
statistic in the proportional hazards model, without assuming the T[ (or T", £") 
to be i.i.d.. Slud [GO] considered weighted logrank statistics with weights that do 
not depend on t. 
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4. Implementation issues in sequential non-(or semi-)parametrics 

This section considers certain implementation issues in sequential nonparametrics 
and semiparametrics, particularly in the context of design and analysis of com- 
plex clinical trials with failure-time endpoints and periodic data reviews. Although 
the asymptotic joint normality of the sequential non-(or semi-) parametric statistics 
greatly simplifies their seemingly intractable distributions, the adequacy of these 
approximations may be questionable. For example, using the normal approximation 
to evaluate the power of a sequential rank test at an alternative that is assumed to 
be "contiguous" may be unreliable as it is difficult to assess whether the alternative 
is actually contiguous for the sample size associated with a prescribed stopping rule 
that allows early termination. Another important issue is related to the construc- 
tion of confidence intervals following sequential tests. The normal approximation 
can be applied if the stopped information time is asymptotically nonrandom by 
Anscombe's [2] theorem, but it is difficult to assess whether that is indeed the case 
in practice. The two time scales in time-sequential clinical trials further increase the 
difficulty. In this section we use recent developments in Monte Carlo and resampling 
methods to address these issues. 

4-1- Direct Monte Carlo and its enhancements 

The Monte Carlo simulation method provides a flexible and practical way to com- 
pute the power and expected duration of time-sequential tests, and also to check 
the adequacy of the normal approximation to the type I error probability under var- 
ious scenarios of baseline survival, censoring pattern, noncompliance, and accrual 
rate. To provide the clinical trial designer with a tool to perform these Monte Carlo 
simulations, Gu and Lai [21] developed a simulation program which gives the user 
some options for choosing the stopping boundary, including the modified Haybittlc- 
Peto-type boundary. They also incorporated this power calculation program into 
another program that computes the sample size of a group sequential trial having 
a prescribed power at given baseline and alternative distributions. Adjustment for 
other covariates in time-sequential tests of treatment effects, however, is unavail- 
able in the program developed by Gu and Lai [21]. Because of the computational 
complexity of time-sequential tests and because the Monte Carlo simulations used 
to compute power and type I error probability should not take too long to run for 
the software to be "user-friendly" , the direct Monte Carlo approach used by Gu and 
Lai [21], which is already slow to run a "bare-bones" trial design, cannot absorb 
the additional computational costs of covariate adjustment without further slowing 
it down substantially. 

Importance sampling and related exponential tilting techniques are powerful 
methods for Monte Carlo evaluation of small probabilities of events observed up 
to a stopping time. The basic idea comes from the likelihood ratio identity P(F D 
{T < oo}) = Eq(LtIfh{t '<oo}) f° r au F € Ft, where T is a stopping time and 
L n is the likelihood ratio; see Siegmund [59], page 13. For complicated problems, 
however, implementation of these importance sampling methods may be difficult 
due to difficulties of sampling from Q. Instead of sampling directly from Q, we 
can generate B sequences from a more convenient distribution Q and then use a 
resampling step to convert samples from Q to samples from Q. There are results 
from the bootstrap literature that shed light on how to carry this out; see Johns 
[30], Davison and Hinkley [13], Do and Hall [14] and Hu and Su [28]. The following 
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example illustrates this idea with the choice of resampling weights for bootstrap 
estimation of the sampling distribution of a two-sample Mann- Whitney statistic 
from unknown (F,G). 

Example 1. Denote the two independent samples by X = {Xi, . . . ,X m } and 
y = {Y\, . . . , Y n } drawn from F and G, respectively. The Mann- Whitney statistic 
is given by U = Y%Li Y%=i u (hj)> where 

f +1 if ^ > Yj, 

u(i,j) = { o xx t = Y j , 

[ -1 ifXi < Yj. 

To estimate the probability P(U < x) from X and y, the bootstrap method es- 
timates P(U* < x\X,y) by Monte Carlo simulations, where U* is the value of 
the statistic calculated from the bootstrap resamplcs X* and y* . To reduce the 
Monte Carlo variance of the bootstrap approach, we fix the first sample and rcsam- 
ple from the second sample with resampling weights pi(i = 1, . . . , n) instead of the 
usual weights 1/n corresponding to the empirical distribution of y. We can estimate 
P{U* < x\X,y) by 7T := B- 1 Y,b=i U blYLi{nPi)~ Mbi , where B is the number of 
resamples, Ub is the value of the test statistic calculated from the 6th resample, and 
Mbi denotes the number of times that Yi appears in the &th resample. The optimal 
resampling weights are those that minimize Var(7r), or equivalcntly, minimize 

n 

E{I(U* < x) Y[(n Pt )- M ' | X,y} 

i=l 

n n 

= E{I(J2 M* Ui < x) Y[(n Pi )- u ' | X, y} 

i=l i=l 
n n 

= E{I(J2 M*( Ui -u)<x-nu) Y[{n Pi )- M ' \ X,y} 

i=l i=l 
n _ _ n 

= E {i(Ym* ^~ u < / x ~ nu ) Y](n Pl r M ' I x,y} 



E{l(Y^M*Ui < x) l[(n Pi )- M ' | X,y} 

i=i i=i 

E{I(Ni < x)e N2 | X,y}, (8) 



where (TVijA^) is bivariate normal random vector with mean (0, \s 2 ) 1 variances 
1, s 2 and covariance ^ UiSi. Here M* denotes the number of times Yi appears in a 
bootstrap resample and 



"o — 

i=l 



J2u(i,j), 5i = -log{npi), s 2 =^6j 

i 

En 



Since E{I{Nx < x)e N2 \X,y} = $(i - sp)e s , where p = J2 u A/\jE u2 s and $ 
denotes the cumulative distribution function of the standard normal distribution, 
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Table 1 

Bootstrap estimates of tail probabilities of 2-sample Wilcoxon statistic and their normal 

approximations 





Direct resampling 


Importance resampling 


0.005 


0.0044 ± 0.0046 


0.0045 ± 0.0006 


0.01 


0.0096 ± 0.0078 


0.0110 ± 0.0012 


0.025 


0.0279 ± 0.0132 


0.0279 ± 0.0030 


0.05 


0.0472 ± 0.0139 


0.0464 ± 0.0040 


0.1 


0.1023 ± 0.0214 


0.0994 ± 0.0078 



the resampling weights that minimize (8) are 

e -An t 

Pi = e -Au 3 > 1 < * < ^ 

where A = A(x) > is chosen to minimize $(i — A)e A ^ . 

Table 1 gives the bootstrap estimate of P(U* < x\X,y) and the standard errors 
for the usual resampling weights 1 /n and for the optimal resampling weights given 
above. Here m = 30, n = 25, B = 500 bootstrap samples are used and F = G is 
the exponential distribution with median 3. Table 1 shows that this importance 
resampling approach yields considerably smaller standard errors than the direct 
resampling approach. Further details of importance sampling with resampling for 
Monte Carlo computation of the power and Type I error of group sequential or 
time-sequential nonparametric/semiparametric tests will be given elsewhere. 

4-2. Confidence intervals following time-sequential trials 

Although group sequential or time-sequential tests are attractive in clinical trials 
because they allow for early termination while preserving the overall significance 
level of the test and can adapt to information gathered during the course of the trial, 
the use of a stopping rule may introduce substantial difficulties in constructing valid 
confidence intervals, which has inhibited the applications of sequential methodology. 
Siegmund [58] introduced an exact method, based on ordering the sample space 
(T, St) hi the following way, to construct exact confidence intervals for the mean 
of a normal population with known variance following a repeated significance test. 
Suppose Zi has variance 1 and T is a two-sided stopping rule of the form T = 
inf{n e J : S n > b n or S n < a n }, where S n = Z\ + ■ ■ ■ + Z n and J is a finite 
set of positive integers. Siegmund ordered the sample space of (T, St) as follows: 
(t, s) > (t' , s') whenever (i) t = t' and s > s', or (ii) t < t' and s > b t , or (iii) t > t' 
and s' < af- Let fx c denote the value of fj, for which P M {(T, St) > (^s)obs} = c, 
where (t, s) b s denotes the observed value of (T, St)- Siegmund's confidence interval 
is \i a < [i < iJLi- a , which has coverage probability 1 — 2a. Tsiatis, Rosner and 
Mehta [64] extended Siegmund's method to the group sequential tests of Pocock 
[46] and O'Brien and Fleming [43]. Alternative orderings of the sample space were 
subsequently introduced by Chang and O'Brien [8], Rosner and Tsiatis [49], Chang 
[7] and Emerson and Fleming [17]. To remove the normal assumption in these exact 
methods, a natural way is to extend Efron's [16] bootstrap confidence intervals 
from the fixed sample size to the group sequential setting. Chuang and Lai [10, 11], 
however, have shown that bootstrap confidence intervals following group sequential 
tests have inaccurate coverage probabilities because the approximate pivots for a 
fixed sample size n may not remain to be approximate pivots when n is replaced 
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by a random stopping time T. By integrating the main ideas behind the exact and 
bootstrap methods, they proposed a hybrid resampling method and considered in 
particular those based on Siegmund's and other orderings for deriving the exact 
methods. They showed that hybrid resampling still works well in situations where 
the bootstrap method fails. 

To extend the hybrid resampling method to more general statistics \Pt and 
stopping rules T , e.g. those arising in time-sequential nonparametrics and semi- 
parametrics, Lai and Li [3G] recently introduced the following ordering scheme for 
the sample space of a stochastic process A„(in which u denotes either discrete or 
continuous time) that is observed up to a stopping time T. Let ^t, t < T, be real- 
valued statistics based on {X t ,t < T}. A total ordering of the sample space of X 
can be defined via ('ft, t < T) as follows: 

X > x if and only if 'J'ta* > ipT/\t, (9) 

in which ("0 S , s < t) is defined from x = (x s , s < t) in the same way as (^ s , s < t) is 
defined from X. In particular, they applied this ordering scheme to construct con- 
fidence intervals for the regression parameter fi in Cox's [12] proportional hazards 
model 

P{y <Yi<y + dy\Yi > y, zj = e^dk{y) 

with baseline cumulative hazard function A. Differentiation of the log partial likeli- 
hood at f3 = (the null hypothesis) and calendar time t yields Cox's score statistic 

&(*)=$>(*){*- ( £ ^ wil ( 10 ) 

i=l I \j£Ri(t) II ) 

where Ri(t) = {j : Yj(t) > Yi(t)} and |i?i(i)| denotes the size of the "risk set" Ri(t), 
using the same notation as that in Section 2 and assuming the censoring variables 
to be i.i.d.. The observed Fisher information at calendar time t is 



2 n 

(11) 



which provides an estimate of the null variance of S n (t). Suppose one uses a repeated 
significance test that rejects Hq at the jth interim analysis (1 < j < k) if 

Sn(i,)/K 1/2 fe) > h or S n (t 3 )/V^ 2 (t 3 ) < a h (12) 

where a 3 < < bj, and stops the trial at the first time r £ {t\, . . . , tk} when (12) 
occurs. Lai and Li [36] proposed to order the sample space of (t, ^r) by 

(ti,*«)<(t 2 ,*«) if and only if *% An <¥*\ n , (13) 

where *t = S(t)/V(t). They also proposed to estimate the unknown baseline dis- 
tribution G = 1 — e~ A by G = 1 — e~ A , where A is Breslow's [6] estimator of the 
cumulative hazard function from all the data at the end of the trial: 

£(*)= E W t ) ( E ^ ) )> ( 14 ) 
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in which (3 is Cox's [12] estimate of (3 that maximizes the partial likelihood at time 
r. Since the are censored by min{li, (r — the distribution C of & can be 

estimated by the Kaplan-Meier estimator C. Let 

p( ( 9)=P{(rW,^ ) )>(r,* T ) obs } ! (15) 

where the superscript (/?) means that the observations are generated by the pro- 
portional hazards model with baseline distribution G, censoring distribution C and 
regression parameter (3. As shown by Lai and Li [36] , the confidence set 

{/? : a < p{0) < 1 - a} (16) 

has coverage probability 1 — 2a + 0{n~ 1 / 2 ) and is usually an interval under certain 
regularity conditions. 

Lai and Li [36] used direct Monte Carlo to compute p((3), and the confidence 
interval thus constructed is computationally intensive. The reason is that a large 
number of simulations are needed to compute p(/3) for each /?, and a sample of 
survival times needs to be generated for each simulation. To reduce the computation 
time, we can use importance sampling by re-writing p(/3) as 

m = e 

in which L(-) is the full likelihood at time r. This importance sampling technique 
provides a one-pass algorithm that only needs to generate data once under P^ (after 
tilting Pp to Pg), instead of having to generate data for each possible value (3 of 
the confidence set in the direct Monte Carlo approach. For the Beta-Blocker Heart 
Attack Trial, Lai and Li [36] computed the confidence interval for the hazard ratio 
by direct Monte Carlo, which took over one day on a computer with Pentium 4 
CPU 2.4GHz and 1024MB of RAM, in contrast to about 2.5 hours to compute the 
hybrid resampling confidence interval via (17) on the same computer. 

An alternative approach that is commonly used in the literature is to use the 
space-time Brownian motion approximation of (S n (t), V n (t)) (see Jones and White- 
head [31] and Siegmund [59]) to which Siegmund's ordering can be applied. The 
following example, however, shows that the confidence intervals for (3 constructed 
by this approach may have quite inaccurate coverage errors. 

Example 2. Consider a time-sequential trial in which n = 350 subjects enter the 
trial uniformly during a 3-year recruitment period and are randomized to treatment 
or control with probability |. The trial is designed to last for a maximum oft* = 5.5 
years, with interim analyses after 1 year and every 6 months thereafter. The logrank 
statistic is used to test H : (3 = at each data monitoring time tj (j = 1, . . . , 10) 
and the test is stopped at the smallest tj such that 

V n (tj) > 55, or V n {tj) > 11 and \S n {tj)\/Vj {tj) > 2.85, (18) 

or at tio(= £*) when (18) does not occur, where V n (t) is defined by (11). If the test 

stops with V n (tj) > 55 or at t* , reject H if \S n (t*)\/V^ (t*) > 2.05. Also reject 
Ho if the second event in (18) occurs for some j < 10. The threshold 2.05 for the 
final analysis at t* is chosen so that the Type I error probability of the test is 
approximately 5% using the Brownian motion approximation; see Siegmund [59], 
page 132. When the space-time Brownian motion approximation of (S n (t),V n (t)) 



1 , 

L0) V^'^C^W 



(17) 
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is applied in conjunction with Siegmund's ordering, thereby yielding score-based 
confidence intervals, the lower and upper coverage errors for nominal value a = 
5% are 4.45% and 5.05% for /3 = 0, 4.65% and 0.35% for (3 = logf, and 5.75% 
and 0.00% for (3 = log | with 2000 simulations. This shows that the Brownian 
motion approximation docs not provide an adequate approximation unless (3 is 
very close to 0. The problem is that it fails to incorporate calendar time besides the 
information time. In contrast, Lai and Li's [36] hybrid resampling confidence set 
(16) has coverage errors 4.45% and 4.55% for (3 = 0, 5.25% and 5.35% for (3 = log §, 
and 5.05% and 4.05% for (3 = log \. 
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